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Abstract 

Microarrays have become extremely useful for analysing genetic phenomena, but establishing a relation between 
microarray analysis results (typically a list of genes) and their biological significance is often difficult. Currently, 
the standard approach is to map a posteriori the results onto gene networks to elucidate the functions perturbed at 
the level of pathways. However, integrating a priori knowledge of the gene networks could help in the statistical 
analysis of gene expression data and in their biological interpretation. Here we propose a method to integrate a priori 
the knowledge of a gene network in the analysis of gene expression data. The approach is based on the spectral 
decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation 
of the high-frequency components of the expression profiles with respect to the topology of the graph. We show how 
to derive unsupervised and supervised classification algorithms of expression profiles, resulting in classifiers with 
biological relevance. We applied the method to the analysis of a set of expression profiles from irradiated and non- 
irradiated yeast strains. It performed at least as well as the usual classification but provides much more biologically 
relevant results and allows a direct biological interpretation. 

Supplementary material: http : / /bioinf o . curie . f r/pro jects/kernelchip 



1 Introduction 



During the last decade microarrays have become the technology of choice for dissecting the genes responsible for a 
phenotype. By monitoring the activity of virtually all the genes from a sample in a single experiment they offer a 
unique perspective for explaining the global genetic picture of a variant, whether a diseased individual or a sample 
subject to whatever stressing conditions. 

However, this strength is also the major weakness of microarrays, and has led to "gene list" syndrome. With care- 
ful experimental design and data analysis, microarrays often present a list of genes that are differentially expressed 
between two conditions, or allow samples to be classified according to their known phenotypic features. Once ob- 
tained, the meaning behind this list of genes, being typically a few hundreds, must be deciphered. Normally, the 
biologist relies on experience for making sense out of the results. MicroaiTays are prone to certain eiTors, and a single 
gene showing differential expression may not necessarily be involved in the process under study. More fundamentally. 
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when it is clear that a gene is involved, the biologist is often more interested in the biological function than in the 
single gene. The behaviour of a biological function usually involves more than just the expression of a single gene, but 
involves the entire set of genes that form the regulation pathways governing the function, which have to be examined. 

One of the aims of microarray data analysis is to discover co-operating genes that are similarly af fected during 
the experiment. Many databases and tools help verify this a posteriori. For example. Gene Ontology fConsortiuml 
I2OOO], Biocarta (www.biocarta.com), GenMAPP (www.genmapp.org) and KEGG |Kanehisa et al., 2004 1 all allow a 
list of genes to be crossed wit h genetic netvyorks, i ncluding metabolic, signalling or other regulation pathways. Basic 
statistical analysis (see, e. g., iHosack et al.L l2003l ) can then determine whether a pathway is over-represented in the 
list, and whether it is over-activated or under-activated. However, it is obvious that introducing information on the 
pathway at this point in the analysis process sacrifices some statistical power to the simplicity of the approach. For 
example, if we assume that genes in the same pathway show a coherent expression pattern, assessing differential gene 
expression could be made more efficient; coherent expression patterns in a regulation pathway would be defined as 
those in which over-expressed inhibitors or activators result in respectively under-expressed or overexpressed targets, 
or vice versa. Thus, a small but coherent difference in the expression of all the genes in a pathway should be more 
significant than larger but random differences. 

There is therefore a pressing need for methods integrating a priori pathway knowledge in the gene expression 
analysis process, and several attempts have been carried out in that direction so far. An initial approach is to derive 
a model for gene expression from an a priori model of ge ne networks; for ex ample, by deriving constraints on co- 
regulated gene expression. Logical discrete formalism | Thom as and Kaufmanl I2OO1I1 can be used to analyse all the 
possible steady states of a biochemical reaction network described by positive and negative influences and can deter- 
mine whether the observed gene expression may be explained by a perturbation of the network. If only the signs of the 
concentration difference s between two steady st ates are considered, it is possible to solve the corresponding Laplace 
equation in sign algebra I Radulescu et al.l*2005'l. giving qualitative predictions for the signs of the concentration dif- 
ferences measured by microarrays. Other approache s, such as the M etaReg formalism |Gat-Viks et al., 2004], and 
boolean and Bayesian networks I Akutsu et all l200d iFriedman et al.[ I2OOOI1 have also been used to predict possible 
gene expression patterns from the network structure, although these approaches adhere less to the formal theory of 
biochemical reaction networks. 

Unfortunately, methods based on network models are rarely satisfactory because detailed quantitative knowledge 
of the complete reaction network parameters is often lacking, or there are only fragments of the network structure 
available. In these cases, more phenomenological approaches need to be used. Pathway scoring methods try to detect 
perturbated "modules" or network pathway s while ignoring the detailed network topology (for recent reviews see 
ICurtis et'all 1200^ ICavalieri and De Filippol l2005l) . It is assumed that the genes inside a module are co-ordinately 
expressed, and thus a perturbation is likely to affect many of them. 

With availa ble databases containin g tens of thousand s reactions and interactions (KEGG iKanehisa et al.L |2004tl, 
TransPath LKruU et all l200(^. BioCyc I Karp et al.Lf2005 ll. Reactome I.Toshi-Tope et all l2005l| and others), the problem 
is how to integrate the detailed graph of gene interactions (and not just crude characteristics such as the inter/intra- 
module connecti vity) into the core microa rray data analysis. Some promising results have been reported with regard 
to this problem. IVert and Kanehisa' f2003>l developed a method for correlating interaction graphs and different types 
of quantitative data, and Rahnenfuhrer et al. |2004] showed that explicitly taking the pathwa y distance between pairs 
of genes into account enhances the statistical scores when identifying activated pathways. Hanisch et al. |2002] re- 
ported the co-clustering of gene expression and gene networks, and the PATIKA project LBabur et aU .2004 1 proposed 
quantifying the compatibility of a pathway with a given microarray data by scoring individual interactions. 

In this paper, we investigate a different approach for integrating genetic networks early in the gene expression 
analysis. We propose a method for calculating the eigen modes of the response of the gene network to a perturbation 
and suggest how these can be introduced into supervised and unsupervised microaiTay data analysis. The method 
uses a hierarchical approach to automatically divide the network into modules having co-ordinated responses. We 
are able to filter out high-frequency noisy modes, thus increasing our ability to interpret expression profiles from 
the underlying genetic network. We illustrate the relevance of our approach by analysing a gene expression d ataset 
that monitors the transcriptional response of irradiated and non-irradiated yeast colonies I Mercier et all E0O4I1 . We 
show that by filtering out 80% of the eigen modes of the KEGG metabolic network in the gene expression, we obtain 
accurate and interpretable discriminative model that may lead to new biological insights. 
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2 Methods 



In this section, we will explain how a gene expression can be decomposed into the eigen modes of a gene network, 
and how to derive unsupervised and supervised classification algorithms from this decomposition. 



2.1 Spectral decomposition of gene expression profiles 

We consider a finite set of genes V of cardinality \ V\ — n. The available gene network is represented by an undirected 
graph G — (V, E) without loop and multiple edges, in which the set of vertices V is the set of genes and E C V x V 
is the list of edges. We will use the notation u ^ v to indicate that two genes u and v are neighbors in the graph, 
that is, (u, v) e E. For any gene u, we denote the degree of u in the graph by d„, that is, its neighbour number 
Gene expression profiling gives a value of expression /(u) for each gene u, and is therefore represented by a function 

f:V^R. 

The Laplacian of the graph G is the n x n matrix iChungiri997ll : 

{d„ if u = V , 
— 1 if u ~ u , (1) 
otherwise . 

The Laplacian is a central concept in spectral graph theory llMohaiiri997ll and shares many properties with the Laplace 
operator on Riemannian manifolds. L is known to be symmetric positive definite and singular We denote its eigenval- 
ues by = Ai < . . . < A„ and the corresponding eigenvectors by ei, . . . , e„. The multiplicity of as an eigenvalue 
is equal to the number of connected components of the graph, and the corresponding eigenvectors are constant on 
each connected component. The eigen-basis of L form s a Fourier ba sis and a natural theory for Fourier analysis and 
spectral decomposition on graphs can thus be derived fChung*, 'l997]. Essentially, the eigenvectors with increasing 
eigenvalues tend to vary more abruptly on the graph, as the smoothest functions (constant on each connected compo- 
nent) are associated with the smallest (zero) eigenvalue. In particular, the Fourier transform / G M" of any expression 
profile / is defined by: 

/, = ^ e,(u)/(u), i = l,...,n. 

Like its continuous counterpart, the discrete Fourier transform can be used for smoothing or for extracting features. 
Here, our hypothesis is that analysing a gene expression profile from its Fourier transform on an a priori given gene 
network should allow the profile to be further analysed through this prior knowledge. In the next two sections we 
illustrate the potential applications of this approach by describing how this leads to a natural definition for distances 
between expression profiles, and how this distance can be used for classification or regression purposes. 



2.2 Deriving a metric for expression profiles 

The definition of new metrics on expression profiles that incorporate information encoded in the graph structure is 
a first possible application of the spectral decomposition. Following the classical methodology in Fourier analysis, 
we assume that the signal captured in the low-frequency component of the expression profiles contains the most 
biologically relevant information, particularly the general expression trends, whereas the high-frequency components 
are more likely measurement noise. For example, the low frequency component of an expression vector on the gene 
metabolic network should reveal areas of positive and negative expression on the graph that are likely to correspond to 
the activation or inhibition of pathways. We can translate this idea mathematically by considering the following class 
of transformations for expression profiles: 

n 

V/eR^, ^4/)=^/,</)(A,)e., (2) 

i=l 

where 4> ■ R^ ^ R is a non-increasing function that quantifies how each frequency is attenuated. For example, if we 
take (j){X) = 1 for all A, the profile does not change that is, S^{f) = /. However, if we take: 

0..es(A) ifO<A<Ao, 
^ ^ ^ lo if A> Ao, 
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we produce a low-pass filter that removes all frequencies from / above the threshold Aq. Finally, a function of the 
form: 

0exp(A) = exp(-/3A) , (4) 

for some /3 > 0, keeps all the frequencies but strongly attenuates the high-frequency components. 

If S^{f) includes the biologically relevant part of the expression profile, we can compare two expression profiles 
/ and g through their representations S^{f) and S^{g). This leads to the following metric between the profiles: 



d^if,gf^\\SM)~S^i9)f 

4=1 

The associated norm is the Euclidean norm associated with the inner product: 

n 

n 
i=l 

= fK^g, 

where Kfjj = ^i^J is the positive semidefinite matrix obtained by modifying the eigenvalues of L through 

(j). For example, taking </)(A) = exp {—(3X) leads to K^j, — ex^M where expj^j denotes the matrix exponential. 



2.3 Supervised learning and regression 

The construction of predictive models for a property of interest from the gene expression profiles of the studied 
samples is a second possible application of the spectral decomposition of expression profiles on the gene network. 
Typical applications include predicting cancer diagnosis or prognosis from gene expression data, or discriminating 
between different treatments applied to micro-organisms. Most approaches build predictive models from the gene 
expression alone, and then check whether the predictive model is biologically relevant by studying, for example, 
whether genes with high weights are located in similar pathways. However the genes often give no clear biological 
meaning. Here, we propose a method combining both steps in a single predictive model that is trained by forcing some 
form of biological relevance. 

We use linear predictive models to predict a variable of interest y from an expression profile / that are obtained by 
solving the following optimisation problem: 

p 

mmy2liw'^f,,y,) + C\\w\\\ (5) 

i=l 

where (/i, yi), . . . , (/p, yp) is a training set of profiles containing the variable y to be predicted, and I is a loss function 
that m easures the cost of predicting fi instead of yi. For example, the popular support vector machine [Boser et all 
Il992ll is a particular case of equation (|5} in which y can take values in —1, +1 and l(u, y) = max(0, 1 — yu) is the 
hinge loss function; ridge regression is obtained for y G M by taking l{u, y) = [u — yY fHastie et al.','200l'|. 

Here, we do not apply algorithms of the form (|3 directly to the expression profiles /, but to their images S^{f). 
That is, we consider the problem: 

p 

mmy^liw'^S4f,),y,) + C\\w\\\ (6) 
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, 1 /2 

Indeed, for any w E M , let v = KJ w. We observe that for any / G M^: 



i=l 



n 



rT 7^1/2 



showing that the final predictor obtained by minimizing ^ is equal to f. We also note that: 



\\w\ 



— WW 



= V 




n 



E 



,2 ' 



i=l 



where the last equality remains valid if is not invertible simply by not including in the sum the terms i for which 
4>{Xi) = 0. This shows that (|5Jl is the equivalent of solving the following problem in the original space; 



Thus, the resulting algorithm amounts to finding a linear predictor v that minimizes the loss function of interest I reg- 
ularised by a terms that penalises the high-frequency components of v. This is different to the classical regularisation 
II V IP that only focuses on the norm of v. As a result, the linear predictor v can be made smoother on the gene net- 
work by increasing the parameter C. This allows the prior knowledge to be direcly included because genes in similar 
pathways would be expected to contribute similarly to the predictive model. 

There are two consequences of this procedure. First, if the true predictor really is smooth on the graph, the 
formulation Q can help the algorithm focus on plausible models even with very little training data, resulting in a 
better estimation. As a result, we can expect a better predictive performance. Second, by forcing the predictive model 
V to be smooth on the graph, biological interpretation of the model should be easier by inspecting the areas of the 
graph in which the predictor is strongly positive or negative. Thus the model should be easier to interpret than models 
resulting from the direct optimisation of equation (|5}. 



We collected the expression data from a study analysing the effect of low irradiation doses on Saccharomyces cere- 
visiae strains iMercier et all 12004 1 . The first group of extracted expression profiles was a set of twelve independent 
yeast cultures grown without radiation (not irradiated, NI). From this group, we excluded an outlier that was indicated 
to us by the author of the article. The second group was a set of six independent irradiated (I) cultures exposed to a 
dose of 15-20 mGu/h for 20h. This dose produces no mutagenic effects, but induces transcrip tional changes. We u sed 
the same normalisation method as in the first study of this data (Splus LOWESS function, see lMercier et al .'. 2004 for 
details) and attempted (1) to separate the NI samples from the I ones, and (2) to understand the difference between the 
two populations in terms of metabolic pathways. 

The gene network m odel used to analyse t he gene expression data was therefore built from the KEGG database 
of metabolic pathways LKanehisa et al.L Eo04ll . The metabolic gene network is a graph in which the enzymes are 
vertices and the edges between two enzymes indicate that the product of a reaction catalysed by the first enzyme is the 
substrate of the reaction catalysed by the second enzyme. We reconstructed this network from the KGML vO.3 version 
of KEGG, resulting in 4694 edges between 737 genes. We kept only the largest connected component (containing 713 
genes) for further spectral analysis. 




(7) 
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4 Results 



4.1 Unsupervised classification 

First, we tested the general effect of modifying the distances between expression profiles using the KEGG metabolic 
pathways as background information in an unsupervised setting. We calculated the pairwise distances between all 17 
expression profiles after applying the transformations defined by the filters and @, over a wide range of parameters. 
We assessed whether the resulting distances were more coherent with a biological intepretation by calculating the ratio 
of intraclass distances over all pairwise distances, defined by: 

+En2,t,2gV2^("2,-^2)^ 

where Vi and V2 are the two classes of points. We compared the results with those obtained by replacing KEGG 
with a random network, produced by keeping the same graph structure but randomly permutating the vertices, in 
order to assess the significance of the results. We generated 100 such networks to give an average result with a 
standard deviation. Figure ^ shows the result for the funtion 0cxp(A) = exp(— /3A) with varying (3 (left), and for 
the function 0thres(A) = 1(A < Aq) for varying Ao (right). We found that, apart from very small values of f3, the 
change of metric with the 0cxp function performed worse than that of a random network. The second method (filtering 
out the high frequency components of the gene expression vector), in which up to 80% of the eigenvectors were 
removed, performed significantly better than that of a random network. When only the top 3% of the smoothest 
eigenvectors were kept, the performance was similar to that of a random network, and when only the top 1% was kept, 
the performance was significantly worse. This explains the disappointing results obtained with the (/fgxp function: by 
giving more weight to the small eigenvalues exponentially, the method focuses on those first few eigenvectors that, as 
shown by the second method, do not provide a geometry compatible with the separation of samples into two classes. 
From the second plot, we can infer that at least 20% of the KEGG eigenvectors should be given sufficient weight to 
obtain a geometry compatible with the classification of the data in this case. 

4.2 PCA analysis 

We carried out a Principal Component Analysis (PCA, I.Tolliffeill996l) on the original expression vectors / and com- 
pared this with a PCA of the transformed set of vectors S^{f) obtained with the function i/ithres to further investigate 
the effect on their relative positions of filtering out the high frequencies of the expression profiles. 

Analysis of the initial sample distribution (see figEJ showed that the first principal component can partially separate 
irradiated from non-irradiated samples, with exception of the two irradiated samples "H" and "12". They have larger 
projections onto the third principal component than onto the first one. The experimental protocol revealed that these 
two samples were affected by higher doses of radiation than the four other samples. 

Gene Ontology analysis of the genes that gave the biggest contribution in the first principal component showed that 
"pyruvate metabolism", "glucose metabolism", "carbohydrate metabolism", and "ergosterol biosynthesis" ontologies 
(here we list only independent ontologies) were over-represented (with p-values less than 10^^"). The second compo- 
nent was associated with "trehalose biosynthesis", and "carboxylic acid metabolism" ontologies and the third principal 
component was associated with the KEGG glycolysis pathway. The first three principal components collected 25%, 
17% and 1 1% of the total dispersion. 

The transformation resulting from a step-like attenuation of eigenvalues fibres removing 80% of the largest 
eigenvalues significantly changed the global layout of data (fig. |2l right) but generally preserved the local neighbour- 
hood relationships. The first three principal components collected 28%, 20% and 12% of the total dispersion, which 
was only slightly higher than the PCA plot of the initial profiles. The general tendency was that the non-irradiated 
normal samples were more closely grouped, which explains the lower intraclass distance values shown in figQ The 
principal components in this case allowed them to be associated with gene ontologies with higher confidence (for 
the first component, the p-values are less than 10^^^). This is a direct consequence of the fact that the principal 
components are constrained to belong to a subspace of smooth functions on KEGG, giving coherence in terms of 
pathways to the genes contributing to the components. The first component gave "DNA-directed RNA polymerase 
activity", "RNA polymerase complex" and "protein kinase activity". Figl^shows that these are the most connected 
clusters of the whole KEGG network. The second component is associated with "purine ribonucleotide metabolism", 
"RNA polymerase complex", "carboxylic acid metabolism" and "acetyl-CoA metabolism" ontologies and also with 
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Figure 1: Performance of the unsupervised classification after changing the metric with the function 0(A) = 
exp(— /5A) for different values of f3 (left), or with the function (j){X) = 1(A < Ao) which varying Aq, that is, by 
keeping a variable number of smallest eigenvalues (right).. The red curve is obtained with the KEGG network. The 
black curves show the result (mean and one standard deviation interval) obtained with a random network. 

"Glycolysis/Gluconeogenesis", "Citrate cycle (TCA cycle)" and "Reductive carboxylate cycle (C02 fixation)" KEGG 
pathways. The third component was associated with "prenyltransferase activity", "lyase activity" and "aspartate family 
amino acid metabolism" ontologies and with "N-Glycan biosynthesis", "Glycerophospholipid metabolism", "Alanine 
and aspartate metabolism" and "riboflavin metabolism" KEGG pathways. Thus, the PCA components of the trans- 
formed expression profiles were affected both by network features and by the microarray data. 

4.3 Supervised classification 

We tested the performance of supervised classification after modifying the distances with a support vector machine 
(SVM) trained to discriminate irradiated samples from non-irradiated samples. For each change of metric, we es- 
timated the performance of the SVM from the total number of misclassifications and the total hinge loss using a 
"leave-one-out" (LOO) approach. This approach removes each sample in turn, trains a classifier on the remaining 
samples and then tests the resulting classifier on the removed sample. For each fold, the regularization parameter was 
selected from the training set only by minimising the classification error estimated with an internal LOO experiment. 
The calculations were carried out using the svmpath package in the R computing environment. 

Figure |3] shows the classification results for the two high frequency attenuation functions 0oxp and 0thies with 
varying parameters. The baseline LOO error was 2 misclassifications for the SVM in the original Euclidean space. For 
the exponential variant {4>cxp{X) — exp(— /3A)), we observed an irregular but certain degradation in performance for 
positive f3 for both the hinge loss and the misclassification number. This is consistent with the result shown in fig[2in 
which the change of metric towards the first few eigenvectors does not give a geometry coherent with the classification 
of samples into irradiated and non-irradiated, resulting in a poorer performance in supervised classification as well. 
For the second variant, in which the expression profiles were projected onto the eigenvector of the graph with the 
smallest eigenvalues, we found that the performance remained as accurate as the baseline performance until up to 
80% of the eigenvectors were discarded, with the hinge loss even exhibiting a slight minimum in this region. This is 
consistent with the classes being more clustered in this case than in the original Euclidean space. Overall these results 
show that classification accuracy can be kept high even when the classifier is constrained to exhibit a certain coherence 
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Figure 2: PCA plots of the initial expression profiles (a) and the transformed profiles using network topology (80% 
of the eigenvalues removed) (b). The green squares are non-irradiated samples and the red rhombuses are uradiated 
samples. Individual sample labels are shown together with GO and KEGG annotations associated with each principal 
component. 

with the graph structure. 

4.4 Interpretation of the SVM classifier 

Figure dH shows the global connection map of KEGG generated from the connection matrix by Cytoscape software 
iShannonet al.ll2003ll . The coefficients of the decision function v of equation Q for the classifier constructed either in 
the original Euclidean space or after filtering the 80% top spectral components of the expression profiles are shown in 
colour. We used a color scale from green (negative weights) to red (positive weights) to provide an easy visualization 
of the main features of the classifiers. Both classifiers give the same classification error but the classifier constructed 
using the network structure can be more naturally interpreted, because the classifier variables are grouped according 
to their participation in the network modules. 

Although from a biological point of view, very little can be learned from the classifier obtained in the original 
Euclidian space (figure|4] left), it is indeed possible to distinguish several features of interest for the classifier obtained 
in the second case (figure 0] right). First, oxidative phosphorylation was found among the pathways with the most 
positive weight s, which is consistent with previous analyses showing that this pathway tended to be up-regulated after 
irradiation LMercier et all .2004,1 . An important cluster involving the DNA and RNA polymerases is also found to bear 
weights sUghtly above average in these experiments. Several stud ies have previously r eported the induction of genes 
involved in replication and repair after high doses of irradiation [ Mercier et all EooHl . but the detection of such an 
induction at the low irradiation doses used in the present biological experiments is rather interesting. The strongly 
negative landscape of weights in the protein kinases cluster has not been seen before and may lead to a new area of 
biological study. Most of the kinases are involved in signalling pathways, and therefore their low expression levels 
may have important biological consequences. 

Figure 0] shows a highlighted pathway named "GlycoIysis/GIuconeogenesis" in KEGG. A more detailed view of 
this pathway is shown in fig|5l This pathway contains enzymes that are also used in many other KEGG pathways and 
is therefore situated in the middle and most entangled part of the global network. As akeady mentioned, this pathway 
is associated with the first and the third principal components of the initial dataset. The pathway actually contains two 
alternative sub-pathways that are affected differentially. Over-expression in the gluceogenesis pathway seems to be 
characteristic of irradiated samples, whereas glyceolyses has a low level of expression in that case .This shift can be 
observed by changing from anaerobic to aerobic growth conditions (called diauxic shift). The reconstruction of this 
from our data with no prior input of this knowledge strongly confirms the relevance of our analysis method. It also 
shows that analysing expression in terms of the global up- or down-regulation of entire pathways as defined, for ex- 
ample, by KEGG, could mislead as there are many antagonist processes that take place inside pathways. Representing 
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Figure 3: Performance of the supervised classification when changing the metric with the function 0oxp(A) = 
exp(— /5A) for different values of (3 (left picture), or the function 0thres(A) = 1(A < Ao) for different values of Ao 
(i.e., keeping only a fraction of the smallest eigenvalues, right picture). The performance is estimated from the number 
of misclassifications in a leave-one-out error. 

KEGG as a large network helps keep the biochemical relationships between genes without the constraints of pathway 
hmits. 

5 Discussion 

Our algorithm groups predictor variables according to highly connected "modules" of the global gene network. We 
assume that the genes within a tightly connected network module are likely to contribute similarly to the prediction 
function because of the interactions between the genes. This motivates the filtering of gene expression profile to 
remove the noisy high-frequency modes of the network. 

Such grouping of variables is a very useful feature of the resulting classification function because the function 
becomes meaningful for interpreting and suggesting biological factors that cause the class separation. This allows 
classifications based on functions, pathways and network modules rather than on individual genes. This can lead to a 
more robust behaviour of the classifier in independent tests and to equal if not better classification results. Our results 
on the dataset we analysed shows only a slight improvement, although this may be due to its limited size. Therefore 
we are currently extending our work to larger data sets. 

An important remark to bear in mind when analyzing pictures such as fig|3 and |5l is that the colors represent 
the weights of the classifier, and not gene expression levels. There is of course a relationship between the classifier 
weights and the typical expression levels of genes in irradiated and non-irradiated samples: irradiated samples tend 
to have expression profiles positively correlated with the classifier, while non-irradiated samples tend to be negatively 
correlated. Roughly speaking, the classifier tries to find a smooth function that has this property. If more samples 
were available, better non-smooth classifier might be learned by the algorithm, but constraining the smoothness of the 
classifier is a way to reduce the complexity of the learning problem when a limited number of samples are available. 
This means in particular that the pictures provide virtually no information regarding the over- or under-expression 
of individual genes, which is the cost to pay to obtain instead an interpretation in terms of more global pathways. 
Constraining the classifier to rely on just a few genes would have a similar effect of reducing the complexity of the 
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Figure 4: Global connection map of KEGG with mapped coefficients of the decision function obtained by applying 
a customary linear SVM (left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues 
have been removed) (right). Spectral filtering divided the whole network into modules having coordinated responses, 
with the activation of low-frequency eigen modes being determined by microarray data. Positive coefficients are 
marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute values of the 
coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some 
other parts of the network are annotated including big highly connected clusters corresponding to protein kinases and 
DNA and RNA polymerase sub-units. 
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Figure 5: The glycolysis/gluconeogenesis pathways of KEGG with mapped coefficients of the decision function ob- 
tained by applying a customary linear SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways 
are mutually exclusive in a cell, as clearly highlighted by our algorithm. 
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problem, but would lead to a more difficult interpretation in terms of pathways. 

An advantage of our approach over other pathway-based clustering methods is that we consider the network mod- 
ules that naturally appear from spectral analysis rather than a historically defined separation of the network into path- 
ways. Thus, pathways cross-talking is taken into account, which is difficult to do using other approaches. It can 
however be noticed that the implicit decomposition into pathways that we obtain is biased by the very incomplete 
knowledge of the network and that certain regions of the network are better understood, leading to a higher connection 
concentration. 

Like most approaches aiming at comparing expression data with gene networks such as KEGG, the scope of 
this work is limited by two important constraints. First the gene network we use is only a convenient but rough 
approximation to describe complex biochemical processes; second, the transcriptional analysis of a sample can not 
give any information regarding post-transcriptional regulation and modifications. Nevertheless, we believe that our 
basic assumptions remain valid, in that we assume that the expression of the genes belonging to the same metabolic 
pathways module are coordinately regulated. Our interpretation of the results supports this assumption. 

Another important caveat is that we simplify the network description as an undirected graph of interactions. Al- 
though this would seem to be relevant for simplifying the description of metabolic networks, real gene regulation net- 
works are influenced by the direction, sign and importance of the interaction. Although the incorporation of weights 
into the Laplacian (equation Q is straightforward and allows the extension of the approach to weighted undirected 
graphs, the incorporation of directions and signs to represent signalling or regulatory pathways requires more work 
but could lead to important advances for the interpretation of microarray data in cancer studies, for example. 
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